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<D 1 Abstract 

in : 

. In data analysis new forms of complex data have to be considered like for ex- 

CN ! ample (symbolic data, functional data, web data, trees, SQL query and multimedia 

data, ...). In this context classical data analysis for knowledge discovery based on 
■ calculating the center of gravity can not be used because input are not W vectors. 

In this paper, we present an application on real world symbolic data using the self- 
^ ' organizing map. To this end, we propose an extension of the self-organizing map 

that can handle symbolic data. 
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> : 

oo ! 1 Introduction 

m ■ 

The self-organizing map(SOM) introduced by Kohonen [H] is an unsupervised neural net- 
work method which has both clustering and visualization properties. It can be considered 
\ as an algorithm that maps a high dimensional data space, M p , to a lower dimension, gener- 
ally 2, and which is called a map. This projection enables the input data to be partitioned 
into "similar" clusters while preserving their topology. Its most similar predecessors are 
X| ■ the k-means algorithm [7] and the dynamic clustering method which operate as a SOM 
without topology preservation and therefore without easy visualization. In data analysis, 
new forms of complex data have to be considered, most notably symbolic data (data with 
an internal structure such as interval data, distributions, functional data, etc.) and semi- 
structured data (trees, XML documents, SQL queries, etc.). In this context, classical data 
analysis based on calculating the center of gravity can not be used because input are not 
W vectors. In order to solve this problem, several methods can be considered depending 
on the type of data (for example projection operators for functional data [8]). However, 
those methods are not fully general and an adaptation of every data analysis algorithm to 
the resulting data is needed. 

The Kohonen's SOM is based on the center of gravity notion and unfortunately, this 
concept is not applicable to many kinds of complex data. In this paper we propose an 
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adaptation of the SOM to dissimilarity data as an alternative solution. Our goal is to 
modify the SOM algorithm to allow its implementation on dissimilarity measures rather 
than on raw data. To this end, we take one's inspiration from the work of Kohonen [5]. To 
apply the method, only the definition of a dissimilarity for each type of data is necessary 
and so complex data can be processed. 

2 Batch self-organizing map for dissimilarity data 

The SOM can be considered as carrying out vector quantization and/or clustering while 
preserving the spatial ordering of the prototype vectors (also called referent vectors) in 
one or two dimensional output space. The SOM consists of neurons organized on a regular 
low-dimensional map. More formally, the map is described by a graph (C,T). C is a set 
of m interconnected neurons having a discrete topology defined by T. 

For each pair of neurons (c, r) on the map, the distance S(c, r), is defined as the shortest 
path between c and r on the graph. This distance imposes a neighborhood relation between 
neurons. The batch training algorithm is an iterative algorithm in which the whole data 
set (noted Q) is presented to the map before any adjustments are made. We note z^ an 
element of Q and Zj the representation of this element in the space D called representation 
space of Q. In our case, the main difference with the classical batch algorithm is that the 
representation space is not M? but an arbitrary set on which dissimilarity (denoted d) is 
defined. 

Each neuron c is represented by a set A c = Zi,...,z q of elements of Q with a fixed 
cardinality q, where Z{ belongs to Q. A c is called an individual referent. We denote A the 
set of all individual referents, i.e. the list A = A\, A m . In our approach each neuron has 
a finite number of representations. We define a new adequacy function d T from Q x P(Q) 
to R + by: 

d T ( Zl , A c ) = J2 RT ( 6 rc) rf2 ( z i> z j) 

reC ZjGAr 

d T is based on the kernel positive function K. K T (6(c,r)) is the neighborhood kernel 
around the neuron r. This function is such that lim K(S) = and allows us to transform 

\S\ >oo 

the sharp graph distance between two neurons on the map (S(c, r)) into a smooth distance. 
K is used to define a family of functions K T parameterized by T, with k T (S) = K(^). T is 
used to control the size of the neighborhood [T]: when the parameter T is small, there are 

few neurons in the neighborhood. A simple example of K T is defined by K T (5) = e - ^". 

During the learning, we minimize a cost function E by alternating an assignment step 
and a representation step. During the assignment step, the assignment function / assigns 
each individual Zi to the nearest neuron, here in terms of the function d T : 

f( Zi ) = argmmd T (z u A c ) 

If there is equality, we assign the individual to the neuron with the smallest label. 
During the representation step, we have to find the new individual referents A* that 
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represent the set of observations in the best way in terms of the following cost function E: 
E(J,A) = J2d T (^A f{Zi) ) = ^^^ T (5(/(^),r)) £ d 2 ( Zi , Zj ) 

This function calculates the adequacy between the induced partition by the assignment 
function and the map referents A. 

The criterion E is additive so this optimization step can be carried out independently 
for each neuron. Indeed, we minimize the m following functions: 

E r = J2K T (S(f( Zl ),r)) 

In the classical batch version, this minimization of E function is immediate because 
the positions of the referent vectors are the averages of the data samples weighted by the 
kernel function. 

3 Experiments 

To evaluate our method, we consider real world interval data. Our adaptation of the 
SOM to dissimilarity data is directly applied to this kind of interval structured data, once 
we can associate dissimilarity to these data. This application concerns monthly minimal 
and maximal temperatures observed in 265 meteorological stations in China. A natural 
representation of the monthly temperature recorded by a station is the interval constituted 
by the mean of the daily minimal and the mean of the daily maximal temperatures observed 
at this station over a month. Table [T] depicts the temperature recorded by the 265 stations 
over a 10-year period (between 1979 and 1988). Each interval is the mean of the minimal 
and the mean of the maximal monthly temperatures for these 10 years. 



Station 


January 


February 




November 


December 


Abag Qi 


[-24.9; -17] 


[-22.3; -12.8] 




[-16.4; -6.2] 


[-24.7; -14.8] 


Hailaer 


[-28.6; -22.5] 


[-25.5; -19.7] 




[-17.4; -9.3] 


[-25.5; -20.0] 



Table 1: Temperatures of the 265 Chinese stations between 1979 and 1988 



We will now describe the parameters used for this application (dissimilarity, map di- 
mensions, number of iterations, ...). The choice of these parameters is important for the 
algorithm. We will then describe the obtained results. We use the factorial dissimilarity 
analysis (for more details [9], [I]) to visualize the maps. 
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3.1 Hausdorff distance 



First, we choose to work with the Hausdorff- type L2-distance on interval data defined as 
follows: 



d(Q,Q') 



\ 



^2(max{\aj - a'^bj^ \ }f 



with Q = (I 1 ,...,I p ) and Q' = I') a pair of items described by p intervals and 

Ij = [dj,bj]. It combines the p one-dimensional, coordinate- wise Hausdorff distances in a 
way which is similar to the definition of the Euclidean distance in MP. The map dimension 
is m = 30 neurons (10 x 3). We use the elements of Q in a random order to initialize the 
map and to choose the initial individual referents A . The cardinality of the individual 
referent q is fixed to 1. 

Figure Q] shows the initial map on factorial dissimilarity analysis plans. 




• • * . • *. 



Figure 1: Initial map and the data on factorial dissimilarity analysis plan 



Figure [5] shows the projection of the map that was finally obtained on the training 
data in factorial plans. 



Figure 2: Final map and the data on factorial dissimilarity analysis plans. Each color 
represents a cluster 



The details of the result, shown in Figure El provide a nice representation of all the 
stations displayed over 30 clusters. These resulting clusters on the geographical map of 
China provide the representation of the stations attached to their referent station. The 
clusters on the right of Figure [2] are cold stations and correspond to the north and west 
of China. The warm stations are on the left of Figure [2] and correspond to the south and 
south-east of China and are characterized by very large variations in temperature. There 
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Figure 3: Distribution of the clusters on the geographical map of China using the same 
colors of Figure 2 



is a continuity from cold stations to warm and hot ones. The analysis of the distribution 
of the clusters on the geographical map of China made it possible to deduce that the 
variations in temperature depend on latitude than on longitude. 

3.2 Euclidean distance 

Secondly, we use the Euclidean distance on interval data defined as follows: 



with Q = (Ji,...,J p ) and Q' = I' p ) a pair of items described by p intervals and 

Ij = [a h b j \. 

We use the same parameters than for the Hausdorff distance. 



d(Q,Q') = l/4\\(a-a') + (b~b')\\ 2 




Figure 4: Final map on factorial dissimilarity analysis plan. Each color represent a cluster 
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Figure H] shows the projection of the final map on factorial dissimilarity analysis plan. 
Figure [5] provides the details of the result. 




The classification is meaningful but different from the one with the Hausdorff distance. 
We will now compare the different results. 

3.3 Discussion 

We use some other metrics for this application but we can't detail the results for lack 
of space. We use the vertex-type distance defined as the sum of the squared Euclidean 
distances between the TP vertices. We use also the mean temperatures of the stations (it's 
a non symbolic representation of the temperatures). In order to compare the different 
results obtained by these different metrics, we calculate longitude and latitude distortions 
of the different obtained clustering. The distortion is defined as the quadratic mean error 
between the referent and their assigned individuals. 
The longitudinal distortion is defined as follows: 

{Dlongf = ~ L °/fe)| 2 

with |c| the cardinal of the cluster c, \Lo Zi — Lof, Zi )\ the longitude distance between the 
station Z( and his referent f{zj) 

The latitude distortion is defined as follows: 

{D lati f = ^2^2ri\La Zl - La f{Zi) \ 2 
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with \La Zi — La/( Zi )| 2 the latitude distance between the station Zi and his referent f(zi). 

In the table [2} we represent the longitude and the latitude distortions of the different 
obtained clustering with the different metrics. 



Data type 


Used metric 


Longitude distortion 


Latitude distortion 


Intervals 


Euclidean distance 


9.250688 


1.993213 


Intervals 


Vertex-type distance 


8.625175 


2.165838 


Means(numerics) 


Euclidean distance 


7.656033 


1.936692 


Intervals 


Hausdorff distance 


7.38314 


1.911461 



Table 2: The different longitude and latitude distortions for the different metrics 

We can deduce that the clustering obtained with the Hausdorff distance induced the 
smallest latitude and longitude distortions. 



4 Conclusion 

In this paper, we proposed an adaptation of the self-organizing map to dissimilarity data. 
This adaptation is based on the batch algorithm and can handle both numerical data and 
complex data. The experiments showed the usefulness of the method and that it can be 
applied to symbolic data or other complex data once we can define dissimilarity for these 
data. 
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